5. Main Analysis - All
Provide a detailed, well-organized description of your findings, including textual description, graphs, and code. Your focus should be on both the results and the process. Include, as reasonable and relevant, approaches that didn’t work, challenges, the data cleaning process, etc. . The guidelines for the Executive Summary above do NOT apply to exploratory data analysis. Your main concern is designing graphs that reveal patterns and trends. . As noted in Hmk #4, do not use circles, that is: bubbles, pie charts, or polar coordinates. . Use stacked bar charts sparingly. Try grouped bar charts and faceting as alternatives, and only choose stacked bar charts if they truly do a better job than the alternatives for observing patterns.
Zip Code, Neighborhood and Land Value Data - Adam
Sidewalk Cafe License Data - Marika
Any business that operates a portion of a restaurant on a public sidewalk must obtain a Sidewalk Cafe License from New Yor City. These licenses must be renewed every two years and fall into three categories: enclosed, unenclosed, or small unenclosed sidewalk cafes.
Analyzing by Borough
First, to help better organize the sidewalk cafe licenses by borough, I added a new column called BOROUGH that is set to MANHATTAN, BROOKLYN, BRONX, or QUEENS. I had to manually check that only the cities in Queens had been called out specifically in the CITY column, so it was easy to distinguish them from BRONX or BROOKLYN.
sidewalks <- sidewalks %>% mutate(BOROUGH = ifelse(CITY=="NEW YORK"|CITY=="New York","MANHATTAN",ifelse(CITY=="BROOKLYN","BROOKLYN",ifelse(CITY=="BRONX","BRONX","QUEENS"))))
To get a better understanding of the distribution of these licenses, I have provided a bar graph by borough.
ggplot(sidewalks, aes(x=fct_infreq(BOROUGH)))+geom_bar(aes(fill=BOROUGH))+ggtitle("Frequency of Sidewalk Cafe Licenses by Borough")+xlab("Borough")+ylab("Frequency")
Clearly Manhattan has the most license requests, followed by Brooklyn, then Queens and finally Bronx. Since at the moment we don’t have neighborhood information (everything in Manhattan is just classified as New York, Brooklyn has only Brooklyn, and Bronx has only the city of Bronx), we can only dive into the Queens data:
queens_cafes <- sidewalks %>% filter(BOROUGH=="QUEENS")
ggplot(queens_cafes, aes(x=fct_infreq(CITY)))+geom_bar(fill="purple")+ggtitle("Frequency of Sidewalk Cafe Licenses in Queens")+xlab("City / Neighborhood")+ylab("Frequency")+coord_flip()
In queens, a large percentage of license requests come from Astoria, followed by Long Island City and Forest Hills.
Next, in order to do date comparisons to ascertain which are the new applications vs. renewal applications, I had to convert certain date fields from strings (they were read in as string factors) into dates.
sidewalks$EXPIRATION_DATE<-as.Date(sidewalks$EXPIRATION_DATE, format="%m/%d/%Y")
sidewalks$APP_STATUS_DATE<-as.Date(sidewalks$APP_STATUS_DATE, format="%m/%d/%Y")
sidewalks$SUBMIT_DATE<-as.Date(sidewalks$SUBMIT_DATE, format="%m/%d/%Y")
Licenses by Status
The list of licenses includes active licenses, expired licenses, licenses for businesses that have closed (and are now inactive), licenses which are up for renewal as part of the two year process, or new requests for licenses. To better classify them, I created a new field called STATUS_CLASSIFICATION. Those licenses which are still active and not up for renewal are classified as “ACTIVE”. Those licenses that have been submitted for renewal (either because their expiration date is less than the latest application data, or that an active license is up for review) are classified as “RENEWAL”. Those licenses that are in the sheet but do not have a license number are classified as “NEW”, and the rest are marked as “OLD” to encompass inactive licenses that have not been acted upon.
sidewalks<-sidewalks %>% mutate(STATUS_CLASSIFICATION = ifelse(LIC_STATUS=="Active" & (APP_STATUS=="Application Approved" | APP_STATUS=="Application Review Completed"),"ACTIVE",ifelse(is.na(LICENSE_NBR),"NEW",ifelse((APP_STATUS_DATE>EXPIRATION_DATE | DPQA=="Issued Temp Op Letter") | (LIC_STATUS=="Active" & (APP_STATUS=="Pending Review" | APP_STATUS=="Submitted")),"RENEWAL","OLD"))))
Now that we have classified the status of the licenses, we are able to see how these classifications differ between the boroughs.
ggplot(sidewalks)+geom_mosaic(aes(x=product(STATUS_CLASSIFICATION,BOROUGH),fill=factor(STATUS_CLASSIFICATION)))+coord_flip()+labs(x="Borough",y="License Status", fill="License Designation")+ggtitle("Boroughs by License Status")
The mosaic plot shows how Bronx and Brooklyn may be getting more new license requests as a percentage of total licenses. Bronx is also getting the highest percentage of renewal requests out of its inactive and active licenses. We can also take a look at the license designations by borough:
ggplot(sidewalks)+geom_mosaic(aes(x=product(BOROUGH,STATUS_CLASSIFICATION),fill=factor(BOROUGH)))+coord_flip()+labs(x="License Status",y="Borough", fill="Borough")+ggtitle("License Status by Borough")
Looking at the data in this way, you can see how Brooklyn has the second-most new license requests, but how Manhattan still dominates in all license status categories.
Mapping Licenses
We can map the data to have a better view of where the datapoints lie. To get an overall picture, I selected a map centered on Long Island City in Queens so that we can get a good view of both Brooklyn and Bronx in addition to Manhattan.
map <- get_map( location = c(-73.9485424, 40.7454513), source = "stamen", zoom = 11, maptype="toner")
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=40.745451,-73.948542&zoom=11&size=640x640&scale=2&maptype=terrain&sensor=false
Map from URL : http://tile.stamen.com/toner/11/602/768.png
cannot remove file 'fa373044657f7c62d16741b55e210dc9.rds', reason 'No such file or directory'Map from URL : http://tile.stamen.com/toner/11/603/768.png
cannot remove file 'b7bd091d8547e6622f48ce421780b6bc.rds', reason 'No such file or directory'Map from URL : http://tile.stamen.com/toner/11/604/768.png
cannot remove file 'e9000e2559ec9b9ab2bae536d0f686ee.rds', reason 'No such file or directory'Map from URL : http://tile.stamen.com/toner/11/602/769.png
cannot remove file 'f6db3397d509bef83f2cc46c94f6c8e2.rds', reason 'No such file or directory'Map from URL : http://tile.stamen.com/toner/11/603/769.png
cannot remove file '0515d6ca301983631471342b2dbde1fd.rds', reason 'No such file or directory'Map from URL : http://tile.stamen.com/toner/11/604/769.png
cannot remove file 'a3b878f5715905f01706e8049b1ab8b2.rds', reason 'No such file or directory'Map from URL : http://tile.stamen.com/toner/11/602/770.png
cannot remove file 'b29acb090826de9776bece6909cea343.rds', reason 'No such file or directory'Map from URL : http://tile.stamen.com/toner/11/603/770.png
cannot remove file 'c20725c8ef636509f5c421172718b0f1.rds', reason 'No such file or directory'Map from URL : http://tile.stamen.com/toner/11/604/770.png
cannot remove file 'c1e025a7c508655340728d50251a5969.rds', reason 'No such file or directory'Map from URL : http://tile.stamen.com/toner/11/602/771.png
cannot remove file '44a64c17db3c40de1f8e0ccb7e734c75.rds', reason 'No such file or directory'Map from URL : http://tile.stamen.com/toner/11/603/771.png
cannot remove file 'd7cfdbf7ca6593ec5424eb027c4828b2.rds', reason 'No such file or directory'Map from URL : http://tile.stamen.com/toner/11/604/771.png
cannot remove file 'cb84db7e40718a6f35709ed846158ce4.rds', reason 'No such file or directory'
Plotting each of the restaurants colored by their borough. You can see how Manhattan dominates in the number of sidewalk cafes, and how the sidewalk cafes in Brooklyn and Queens are largely concentrated in the areas closer to Manhattan.
ggmap(map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=sidewalks, alpha=0.3)+ggtitle("Distribution of all licenses (inactive or active) in NYC")
Even with alpha, it is difficult to tell exactly where the largest concentrations of sidewalk cafes lie. Using a density plot, we can better see the concentration of sidewalk cafes around mid-to lower Manhattan, and the clusters in Astoria and Williamsburg.
g <- ggmap(map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks, geom="polygon", alpha=0.2)
g+scale_fill_gradient(low="yellow",high="red")+ggtitle("Heatmap of Sidewalk Cafe Licenses in NYC")
The next question we can ask is whether there are clear patterns to where new license requests are coming in from, where they are being renewed, or where they have expired without renewal.
ggmap(map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=sidewalks, alpha=0.4)+facet_wrap(~STATUS_CLASSIFICATION)+ggtitle("Sidewalk Cafe Licenses by Status")
However, since we are zoomed out a lot and looking at the entire set of licenses, it is difficult to tell the difference between the distributions of license requests. For this analysis, I am only concerned about active licenses (whether they are being renewed or not), and new requests that have not yet been approved. To do this, I created another STATUS_CLASSIFICATION that groups active renewal requests into the Active category, and sets all other non-new requests as “inactive”.
sidewalks <- sidewalks %>% mutate(STATUS_CLASSIFICATION2=ifelse(STATUS_CLASSIFICATION=="ACTIVE" | (STATUS_CLASSIFICATION=="RENEWAL" & LIC_STATUS=="Active"), "ACTIVE", ifelse(STATUS_CLASSIFICATION=="NEW","NEW","OLD")))
new_active <- sidewalks %>% filter(STATUS_CLASSIFICATION2=="ACTIVE" | STATUS_CLASSIFICATION2=="NEW")
ggmap(map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=new_active, alpha=0.3)+facet_wrap(~STATUS_CLASSIFICATION2)+ggtitle("Active and New Licenses in New York City")
Since we are quite zoomed out, it is difficult to see what exactly is happening with the New requests. However, if we zoomed in, we would lose information about the Bronx or lower Brooklyn. In order to determine whether there is any useful information there, I have zoomed in on the Brooklyn region.
Geographic Analysis - Bronx
bronx_map <- get_map("Bronx, NY", source = "google", zoom = 12, maptype="roadmap", color="bw")
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Bronx,+NY&zoom=12&size=640x640&scale=2&maptype=roadmap&language=en-EN&sensor=false
Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Bronx,%20NY&sensor=false
ggmap(bronx_map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=new_active)+facet_wrap(~STATUS_CLASSIFICATION2)+ggtitle("Active and New Licenses in the Bronx")
Since there are very few active or new licenses in the Bronx, I feel comfortable in zooming in on the rest of Manhattan / Brooklyn and Queens in order to be able to better see what is happening at the expense of Bronx.
Geographic Analysis - Manhattan
I want to take a closer look at what is happening in Manhattan. In order to do this at a more granular level, I will use my merged data with our master_zip document which lists the neighborhoods of each Borough by zip code. I also pulled out the year of the submission of the license application and saved it as SUBMIT_YEAR.
master_zip<-read.csv("Data/zips_master_no_missing_nbrh.csv", strip.white=TRUE)
sidewalks_nbh <- merge(sidewalks, master_zip, by="ZIP", all.x=TRUE)
sidewalks_nbh <- sidewalks_nbh %>% mutate(SUBMIT_YEAR=year(SUBMIT_DATE))
sidewalks_manhattan_active <- sidewalks_nbh %>% filter(BOROUGH=="MANHATTAN" & (STATUS_CLASSIFICATION2=="ACTIVE" | STATUS_CLASSIFICATION2=="NEW"))
sidewalks_brooklyn_active <- sidewalks_nbh %>% filter(BOROUGH=="BROOKLYN" & (STATUS_CLASSIFICATION2=="ACTIVE" | STATUS_CLASSIFICATION2=="NEW"))
sidewalks_bronx_active <- sidewalks_nbh %>% filter(BOROUGH=="BRONX" & (STATUS_CLASSIFICATION2=="ACTIVE" | STATUS_CLASSIFICATION2=="NEW"))
sidewalks_queens_active <- sidewalks_nbh %>% filter(BOROUGH=="QUEENS" & (STATUS_CLASSIFICATION2=="ACTIVE" | STATUS_CLASSIFICATION2=="NEW"))
manhattan_map <- get_map("Manhattan, NY", source="google", maptype="roadmap", zoom=12, color="bw")
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Manhattan,+NY&zoom=12&size=640x640&scale=2&maptype=roadmap&language=en-EN&sensor=false
Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Manhattan,%20NY&sensor=false
ggmap(manhattan_map)+geom_point(aes(x=LONGITUDE, y=LATITUDE, color=nbh), data=sidewalks_manhattan_active)+ggtitle("All active or new sidewalk cafes in Manhattan")
manhattan_names <- sidewalks_manhattan_active %>% distinct(nbh, latitude, longitude) %>% group_by(nbh) %>% summarise(mean_lon=mean(longitude), mean_lat=mean(latitude))
brooklyn_names <- sidewalks_brooklyn_active %>% distinct(nbh, latitude, longitude) %>% group_by(nbh) %>% summarise(mean_lon=mean(longitude), mean_lat=mean(latitude))
queens_names <- sidewalks_queens_active %>% distinct(nbh, latitude, longitude) %>% group_by(nbh) %>% summarise(mean_lon=mean(longitude), mean_lat=mean(latitude))
bronx_names <- sidewalks_bronx_active %>% distinct(nbh, latitude, longitude) %>% group_by(nbh) %>% summarise(mean_lon=mean(longitude), mean_lat=mean(latitude))
This view shows all of the distribution of all currently active or new sidewalk cafes in Manhattan. To get a better idea of density, we can also look at a heatmap of this view.
g <- ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan_active, geom="polygon", alpha=0.3)
g+scale_fill_gradient(low="yellow",high="red")+ggtitle("Heatmap of Sidewalk Cafe Licenses in Manhattan")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=nbh),fontface="bold", size=3)
This heatmap shows the highest concentration of sidewalk cafes in Greenwich Village and NoHo. To see the difference between active and new licenses, we can facet based on our previously calculated categorization.
g <- ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan_active, geom="polygon", alpha=0.3)+facet_wrap(~STATUS_CLASSIFICATION2)
g+scale_fill_gradient(low="yellow",high="red")+ggtitle("Heatmap of Sidewalk Cafe Licenses in Manhattan")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=nbh),fontface="bold", size=3)
The heatmap view of the map is not a great one - it is difficult to tell the differences on the same scales in a heatmap. Looking at the same view, but by using geom_point again, we get the following view:
ggmap(manhattan_map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=nbh),data=sidewalks_manhattan_active)+facet_wrap(~STATUS_CLASSIFICATION2)+ggtitle("Active and New Licenses in Manhattan") + guides(color=FALSE)+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=nbh),fontface="bold", size=3)
Once again, the data is not telling us too much since there are very few new requests. Perhaps a better way to categorize this data is to look at the years that applications were submitted.
ggplot(sidewalks_manhattan_active, aes(x=SUBMIT_YEAR))+geom_bar()
Since there are only one submission in both 2000 and 2014, I will be removing these outliers as potential mistakes where the submit dates were not updated once the licenses were renewed. Now we can look at a mosaic plot of the different neighborhoods and the years that the licenses were requested.
sidewalks_manhattan_active<-sidewalks_manhattan_active %>% filter(SUBMIT_YEAR > 2014)
manhattan_counts <- sidewalks_manhattan_active %>% group_by(nbh, SUBMIT_YEAR) %>% summarise(count=n())
manhattan_counts$SUBMIT_YEAR<-as.factor(manhattan_counts$SUBMIT_YEAR)
manhattan_counts$SUBMIT_YEAR<-factor(manhattan_counts$SUBMIT_YEAR, c("2017", "2016","2015"))
ggplot(manhattan_counts, aes(x=reorder(nbh, count), y=count, fill=SUBMIT_YEAR))+geom_bar(stat="identity",position=position_dodge())+coord_flip()+xlab("Neighborhood")
It makes sense that 2016 would have more applications than 2015, since many of the 2015 applications have been renewed by now. However, to get a better understanding of the distribution of areas, we can create three separate heatmaps.
map2015 <- ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan_active%>%filter(SUBMIT_YEAR=="2015"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")+ggtitle("Manhattan Licenses Applied for in 2015")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=nbh),fontface="bold", size=3)
map2015
map2016<-ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan_active%>%filter(SUBMIT_YEAR=="2016"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")+ggtitle("Manhattan Licenses Applied for in 2016")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=nbh),fontface="bold", size=3)
map2016
map2017 <- ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan_active%>%filter(SUBMIT_YEAR=="2017"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")+ggtitle("Manhattan Licenses Applied for in 2017")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=nbh),fontface="bold", size=3)
map2017
Across all three years, you can see how the concentrated area of licenses remains in the Greenwich and Village NoHo area. However, it looks like the number of applicants in the Upper West Side in 2017 has dropped.
queens_map <- get_map("Astoria, NY", source = "google", zoom = 12, maptype="roadmap", color="bw")
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Astoria,+NY&zoom=12&size=640x640&scale=2&maptype=roadmap&language=en-EN&sensor=false
Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Astoria,%20NY&sensor=false
Listening on http://127.0.0.1:3509